Commit Graph

  • 29694f4efe startup fixes Matt Wells 2014-03-08 10:25:56 -07:00
  • 8aa0662a27 Merge branch 'diffbot' into testing Matt Wells 2014-03-08 09:38:44 -07:00
  • 14817df7a9 new site patterns api stuff Matt Wells 2014-03-08 09:23:32 -07:00
  • 7cdd411ef1 Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-03-07 09:26:47 -08:00
  • 72fab5b61e Do not end a crawl while urls are still being spidered because they might add more links to spiderdb when they finally complete. Matt Wells 2014-03-07 09:30:12 -08:00
  • dcd42e455e Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-03-07 09:02:29 -08:00
  • c143ee1fba fix core when creating a new collection because we incremented m_numRecs but did not grow the ptr buffer. also added support for localgb.conf so we can use that instead of gb.conf to avoid git push/pull conflicts. Matt Wells 2014-03-07 09:05:14 -08:00
  • f777e6cccd Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-03-07 08:23:21 -08:00
  • d6177019ec minor fix Matt Wells 2014-03-07 08:07:09 -08:00
  • 434dd182d4 fix mem leak. always harvest links for custom crawls. Matt Wells 2014-03-06 21:24:39 -08:00
  • e351d2a6f1 get searching on token working Matt Wells 2014-03-06 17:01:41 -08:00
  • 27e8e810d2 use collnum instead of coll string. more stable since resetting collections keeps string the same but changes the collnum. Matt Wells 2014-03-06 15:48:11 -08:00
  • d74f748e93 search all collections under a token if "&token" is given but not "&c=..." Matt Wells 2014-03-06 11:00:43 -08:00
  • 97e46dbf4e Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-03-06 10:45:45 -08:00
  • ca2d307229 revert gb.conf Matt Wells 2014-03-06 10:47:03 -08:00
  • efa92b16fd Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-03-06 10:45:35 -08:00
  • 25cf0efdbf first compiled stab at multi collection searching. Matt Wells 2014-03-06 10:45:13 -08:00
  • 451a092378 fix core from changing parms while evaluating a url. Matt Wells 2014-03-06 07:47:43 -08:00
  • 0962e243a4 Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-03-05 07:43:25 -08:00
  • 58a1feeea5 specify &header=1 explicitly to get json serp header lest we break our clients parsers Matt Wells 2014-03-05 07:41:59 -08:00
  • 13e33bc261 fix jezebel crawl from hanging. Matt Wells 2014-03-04 19:45:26 -08:00
  • 1b62f1582b print memtable when almost full so we can see where the leak is. more spiders for ethan. do not try to get diffbot reply if page is already json. likely it is an injected diffbot json reply. Matt Wells 2014-03-04 18:19:50 -08:00
  • 603cd67758 fix csv downloads some more Matt Wells 2014-03-04 12:07:46 -08:00
  • 2ab9aaeeaa streaming csv fixes Matt Wells 2014-03-04 11:04:26 -08:00
  • 866b09d25e Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-03-04 10:46:28 -08:00
  • b1381cc610 make csv streamable, faster and take almost no memory. Matt Wells 2014-03-04 10:45:57 -08:00
  • 280dcb85cf fix for passing testUpdatedContent smoketest Matt Wells 2014-03-04 09:09:51 -08:00
  • ab9f2b33c1 definition updates Matt Wells 2014-03-04 08:37:39 -07:00
  • 1acb16b1ee tweak empty doledb priority logic. anchor it more to m_doleIpTable for more reliability. seems like it was causing some slow dows during spidering. seems more continuous now. Matt Wells 2014-03-03 13:48:59 -08:00
  • 48b5330d9c only skip checking to spider a url of its doleip table is empty Matt Wells 2014-03-03 13:22:27 -08:00
  • 282dad6cef deal with no coll recs when getting link text using msg25. do not share g_lineTable between collections. Matt Wells 2014-03-03 08:04:24 -08:00
  • ff8a0b4ef1 do not let all collections share the same line table in linkdb.cpp Matt Wells 2014-03-03 07:50:11 -08:00
  • a82abe8260 added ^ operator to url crawl patterns. good for tmz crawl. Matt Wells 2014-03-02 14:57:59 -08:00
  • 7fd6bbd7f5 added ^ support to url crawl expressions Matt Wells 2014-03-02 14:41:25 -08:00
  • e4d425c18f fix coll being deleted when getting link text. Matt Wells 2014-03-02 14:24:49 -08:00
  • bb5016e88b add the following fields to json search results: currentTimeUTC, responseTimeMS, docsInCollection, hits, moreResultsFollow, and docId. Changes structure of json so that now the results array is returned as an array within a dictionary (field name "results") as opposed to being the only object returned Daniel Steinberg 2014-03-01 11:16:17 -08:00
  • aeb2833d20 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-02-28 11:46:44 -08:00
  • 11efab9862 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-02-28 08:23:59 -08:00
  • c596d38e60 fix core from getting title of json object Matt Wells 2014-02-28 08:18:09 -08:00
  • 5f3aa24805 took out restrictDomain logic. now we always only follow links on the same domain as the seed UNLESS a url crawl pattern or a url crawl regex was specified. Matt Wells 2014-02-27 19:53:17 -08:00
  • 42f254125e fix core in new link text logic. empty msg25 replies are ok if g_errno is set. Matt Wells 2014-02-27 13:56:32 -08:00
  • 365fc16606 fix core in "wait in line" logic when getting link info in Linkdb.cpp. Matt Wells 2014-02-27 09:22:35 -08:00
  • af9eb8fb73 need to allow clients to not restrict to seed domains. Matt Wells 2014-02-26 22:27:22 -08:00
  • 927f4626ee Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-02-26 22:26:13 -08:00
  • eaca38cbfd fix new result streaming logic some more Matt Wells 2014-02-26 21:42:43 -08:00
  • 0933884191 fix super fast and mem efficient search results streaming code. Matt Wells 2014-02-26 21:18:08 -08:00
  • f11e25024a Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-02-26 20:34:06 -08:00
  • 1030e6ada8 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-02-26 20:30:20 -08:00
  • b429f12346 add logic to save memory when streaming over 200 results back. should fix oom when streaming back hundreds of thousands of results. Matt Wells 2014-02-26 20:33:35 -08:00
  • 8208178c79 remove "Initial crawl request" dups from the urls.csv. do not count fake firstip spider requests attempts in xmldoc.cpp as crawlbot page download attempts since we just re-add that request with the correct firstip and bail. it basically doubles this count form what users would expect. Matt Wells 2014-02-26 15:48:52 -08:00
  • b450bfc2a6 do not show html column in csv. libreoffice and excel flub it if a cell is over 32k or so. Matt Wells 2014-02-26 15:03:05 -08:00
  • a0697e1bb5 do not allow custom crawls to spider the web any more. Matt Wells 2014-02-26 10:26:09 -08:00
  • 6445b0572b fix gb.conf Matt Wells 2014-02-26 01:08:20 -08:00
  • a6b7e088f5 take out tfndb, unused. fix core from diffbot url too long. Matt Wells 2014-02-26 01:07:13 -08:00
  • 6716d8f21b remove entry from linetable for linkinfo lookup Matt Wells 2014-02-26 00:27:29 -08:00
  • 8bb5d106db fixes for query reindex/delete. Matt Wells 2014-02-25 18:12:45 -08:00
  • 33c8123288 more fixes for new link info code. Matt Wells 2014-02-25 13:53:41 -08:00
  • 9c486c77ed Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-02-25 12:32:40 -08:00
  • cf6695f625 speed up getNumTotalRecs() by caching it basically for 2 seconds since pingserver.cpp calls it all the time. Matt Wells 2014-02-25 12:14:51 -08:00
  • b3ff7df904 Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-02-25 11:05:46 -08:00
  • b58d88c57f fix sections infinite loop bug. Matt Wells 2014-02-25 11:09:07 -08:00
  • 94a55bf9a6 fixes for new link info code so it doesn't bottleneck. got EFENCE_SIZE working so we can use efence on large allocs only so we don't go oom using it. might help finding some of the out of bounds writing going on. Matt Wells 2014-02-25 10:55:05 -08:00
  • ceb623bb8f do not dedup bulks. only respider urls if error is tmp. mess with msg1 in spider.cpp so niceness is MAX_NICENESS and not 0 because it was not able to trigger a doledb dump. Matt Wells 2014-02-23 20:04:46 -08:00
  • 72f1312652 new linkdb code compiling. Matt Wells 2014-02-20 17:27:28 -08:00
  • 9820f14066 checkpoint Matt Wells 2014-02-20 14:54:21 -08:00
  • 88dfa20cbe docid based spider rec related fixes. Matt Wells 2014-02-20 08:46:00 -08:00
  • e87b71caef fix query reindex core Matt Wells 2014-02-19 21:07:01 -08:00
  • b37b19ea4a print comma before json item so we do not end in trailing comma ever Matt Wells 2014-02-19 10:04:49 -08:00
  • dda7648333 try to fix problem of crawls stopping when they shouldn't. seems like it might be doing the trick. Matt Wells 2014-02-19 00:51:46 -08:00
  • b48adc0542 try to fix crawls stopping too early Matt Wells 2014-02-18 10:28:48 -08:00
  • ae2aed7066 try to fix a few cores from deleting collections. try to spider urls again if user changes certain crawling parms. like regex, patterns, etc. Matt Wells 2014-02-18 09:44:15 -08:00
  • 117f0ca4e8 more bio updates Matt Wells 2014-02-17 15:32:45 -07:00
  • bc2b9d6179 bio.html updates Matt Wells 2014-02-17 15:29:08 -07:00
  • f942183104 ignore maxtocrawl for bulk jobs too Matt Wells 2014-02-16 22:24:17 -08:00
  • a4deb7ff08 exempt bulk jobs from maxtoprocess Matt Wells 2014-02-16 22:14:43 -08:00
  • 9c9d5fff98 print out content type in caps with maroon bg in serps. use empty site patterns to mean no restriction, not "*" anymore for simplicity. Matt Wells 2014-02-16 22:47:02 -07:00
  • 0b5cd6d3f9 more parm fixes Matt Wells 2014-02-16 22:18:39 -07:00
  • 48315f6dc3 parm fixes Matt Wells 2014-02-16 22:13:27 -07:00
  • 725b6189a7 show user's ip in master ips description so they can add it to the list easily. Matt Wells 2014-02-16 21:56:31 -07:00
  • 9d0dca71db fix rapid coll delete bug some more. Matt Wells 2014-02-16 20:13:06 -08:00
  • def7822a22 multiple red boxes for clarity Matt Wells 2014-02-16 20:33:51 -07:00
  • ce652462b0 add color coded circles to coll nav bar. disk usage red box. Matt Wells 2014-02-16 19:59:53 -07:00
  • 88a151f1d9 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-02-16 16:02:00 -08:00
  • c691b2dd5f hopcount precedence fix Matt Wells 2014-02-16 16:01:29 -08:00
  • a4b6716623 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-02-16 15:57:46 -08:00
  • 29c9e49935 do not do link loop detection if doing a custom crawl. Matt Wells 2014-02-16 15:57:32 -08:00
  • f8135e628e fall back to hop count if priority is tied (and both are due to be spidered). defaults back to breadth first like it was doing before. Matt Wells 2014-02-16 15:52:08 -08:00
  • 0a4963f597 do not allow spot/seeds to be added to collnum being repaired or rebuilt. Matt Wells 2014-02-16 15:18:50 -08:00
  • fe63371622 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-02-16 13:39:02 -08:00
  • 4930243de3 minor updates Matt Wells 2014-02-16 13:38:54 -08:00
  • fe0f2d3537 allow coll delete if not the one being repaired Matt Wells 2014-02-16 10:55:34 -08:00
  • 32526a9b25 more checksum fixes for json. fixes for repair/rebuild procedure. Matt Wells 2014-02-16 10:46:41 -08:00
  • df59d3946a fix content hash issues for json. do not hash url/resolved_url/html fields. do exact order-independent hashes of remaining field/value pairs. used for setting EDOCUNCHANGED and doing spidertime/querytime deduping. also do not index "html" json field because it is huge, slow and redundant. convert "date" field into a number so we can sort/constrain by article pub date. Matt Wells 2014-02-15 14:40:56 -08:00
  • 734ce1fc55 fix core from a high priority injection insert records at the same time as a lower priority spider. Matt Wells 2014-02-14 10:51:02 -08:00
  • 3271f22995 Merge branch 'diffbot-testing' into diffbot Matt Wells 2014-02-13 11:25:59 -08:00
  • dc8b9090e8 fix out of alloc slots core Matt Wells 2014-02-13 11:21:39 -08:00
  • 08b103f3a4 Merge branch 'diffbot-testing' into diffbot Matt Wells 2014-02-13 10:11:56 -08:00
  • c3d8a143be fix bug of process regex being ignored when crawl regex was specified. Matt Wells 2014-02-13 10:06:14 -08:00
  • 4eee547391 do not do fuzzy deduping if &icc=1 (include cached copy) is true for search results. Matt Wells 2014-02-13 08:51:03 -08:00
  • cd6069e5a6 send single space to socket if not streaming and search results still not ready after 10 seconds. send it every 10 seconds to prevent client from closing socket. sped up all downloads, json and csv, but not doing "fuzzy" deduping of search results, but just deduping on page content hash. added TcpSocket::m_numDestroys to ensure we do not send heartbeat on a socket that was closed and re-opened for another client. Matt Wells 2014-02-13 08:45:13 -08:00