Ignacio LP.

Extrayendo noticias de google news.

Web scraping

Hace poco necesitaba reunir las noticias de x persona y busqué librerías que permitieran extraer noticias de internet según palabras claves, luego de probar unas 5 salí desilusionado por no hallar ninguna que funcionara bien, la mejor y más famosa me dió el problema que además de ser lenta los titulares me venían parcialmente completos, por lo mismo comparto el pequeño código que me solucionó esto.

Advierto que este código puede requerir modificaciones más adelante según cambie el tipo de respuesta de google news.

Limpiando datos

Para esta solución se usa los pasos repetitivos del web scraping, recolección de datos, encontrar patrones en la respuesta y entonces apartar lo que nos gusta.

            
public class NewsUnmarshal
{
    private readonly HtmlDocument _htmlDocument;

    public NewsUnmarshal()
    {
        _htmlDocument = new HtmlDocument();
    }

    public IEnumerable<NewsDiv> Do(string html, string discriminatorTitle)
    {
        _htmlDocument.LoadHtml(html);

        // get all the news articles where id is main
        var main = _htmlDocument.DocumentNode.SelectNodes("//div[@id='main']");

        // take all divs
        var divs = main[0].SelectNodes("./div").Skip(1);

        // if in the first inner text contain the word discriminator then is a news article
        foreach (var div in divs)
        {
            var childDivs = div.SelectSingleNode("./div")?.SelectSingleNode("./a");

            if (childDivs == null)
            {
                continue;
            }

            var arrayOfTexts = new List<string>();

            foreach (var childDiv in childDivs.SelectNodes("./div"))
            {
                if (childDiv != null)
                    arrayOfTexts.Add(WebUtility.HtmlDecode(childDiv.InnerText));
            }

            if (arrayOfTexts.Count < 2)
            {
                continue;
            }

            if (arrayOfTexts.ElementAt(0).RemoveDiacritics()
                .Contains(discriminatorTitle, StringComparison.InvariantCultureIgnoreCase))
            {
                yield return new NewsDiv
                {
                    Title = (arrayOfTexts.ElementAt(0)),
                    Description = (arrayOfTexts.ElementAt(1)),
                    Href = childDivs.GetAttributeValue("href", string.Empty)
                };
            }
        }
    }

    public IEnumerable<NewsDiv> Do(byte[] html, string discriminatorTitle)
    {
        var htmlString = Encoding.UTF8.GetString(html);

        return Do(htmlString, discriminatorTitle);
    }
}

public class NewsDiv
{
    public string Title { get; set; }
    public string Origin { get; set; }
    private string _description;

    public string Description
    {
        get => _description;
        set
        {
            _description = value;
            TimeCreatedAt = GetTime(value);
        }
    }

    private DateTime GetTime(string time)
    {
        // return the time depending of time before, this can be x días, x semanas, x meses, x años, horas and begin with this text: hace x y
        var regex = new Regex(@"hace (\d+) (días|semanas|meses|años|horas)");
        var match = regex.Match(time);

        if (match.Success)
        {
            var number = match.Groups[1].Value;
            var type = match.Groups[2].Value;

            return type switch
            {
                "días" => DateTime.Now.AddDays(-int.Parse(number)),
                "semanas" => DateTime.Now.AddDays(-int.Parse(number) * 7),
                "meses" => DateTime.Now.AddMonths(-int.Parse(number)),
                "años" => DateTime.Now.AddYears(-int.Parse(number)),
                "horas" => DateTime.Now.AddHours(-int.Parse(number)),
                _ => throw new ArgumentOutOfRangeException(nameof(type), type, null)
            };
        }

        return DateTime.Now;
    }

    public DateTime TimeCreatedAt { get; set; }

    private string _href;

    public string Href
    {
        get => _href;
        set
        {
            var reg = new Regex(@"/url\?q=(.*?)&");
            var match = reg.Match(value);

            if (match.Success)
            {
                _href = match.Groups[1].Value;
            }
            else
            {
                _href = value;
            }
        }
    }
}

¿Código espagueti?, ¿donde?

Explicación

Lo único que hace es leer el html obtenido desde este código.


public class NewsScraper
{
    HttpClient _client;

    public NewsScraper(HttpClient client)
    {
        _client = client;
    }

    /// <summary>
    /// 
    /// </summary>
    /// <param name="name">name to distinct when a article talk about this person</param>
    /// <param name="start">skip articles, be default is ten on ten</param>
    /// <returns></returns>
    public async Task<string> GetTo(string name, int start = 0)
    {
        var url = MakeUrl(name, start);

        Console.WriteLine(url);

        var response = await _client.GetAsync(url);

        var content = await response.Content.ReadAsStringAsync();

        return content;
    }

    private string MakeUrl(string searhKey, int start = 0) =>
        $"https://www.google.com/search?q={searchKey.Replace(" ", "+")}&tbm=nws&start={start}";
}

Con el html obtenido lo pasa a la clase anterior en el método Do, debe indicar un segundo string que es el discriminador de títulos, si una palabra no se encuentra entonces saltará al siguiente articulo.

Respecto al Parse, solo extrae el main div, luego para cada div extrae el primer div con sun innerText, si contiene el texto discriminador entonces todo el div se separa en 2 strings, el primero contiene el título y el segundo contiene el contenido.

Una ves obtendio ambos textos, solo se verifica el title contenga la palabra clave, al texto se le extrae la última parte y se transforma a un DateTime, con regex esto se hace muy simple. Y listo, ya tenemos un objeto con el título, descripción, fecha y href.